The goal of this exercise is to predict the houses price in Ames, Iowa. I will try to predict it using cross validation with different data divisions and feature selection. I will check the best model linear/SGD for the better score and loss.
#update plotly and pandas_profiling version
!pip install --upgrade plotly
!pip install sweetviz
Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.18.0)
Collecting plotly
Using cached plotly-5.19.0-py3-none-any.whl.metadata (7.0 kB)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.2.3)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from plotly) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging->plotly) (3.1.1)
Using cached plotly-5.19.0-py3-none-any.whl (15.7 MB)
Installing collected packages: plotly
Attempting uninstall: plotly
Found existing installation: plotly 5.18.0
Uninstalling plotly-5.18.0:
Successfully uninstalled plotly-5.18.0
Successfully installed plotly-5.19.0
Collecting sweetviz
Using cached sweetviz-2.3.1-py3-none-any.whl.metadata (24 kB)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in /opt/conda/lib/python3.10/site-packages (from sweetviz) (2.2.0)
Requirement already satisfied: numpy>=1.16.0 in /opt/conda/lib/python3.10/site-packages (from sweetviz) (1.26.4)
Requirement already satisfied: matplotlib>=3.1.3 in /opt/conda/lib/python3.10/site-packages (from sweetviz) (3.7.5)
Requirement already satisfied: tqdm>=4.43.0 in /opt/conda/lib/python3.10/site-packages (from sweetviz) (4.66.1)
Requirement already satisfied: scipy>=1.3.2 in /opt/conda/lib/python3.10/site-packages (from sweetviz) (1.11.4)
Requirement already satisfied: jinja2>=2.11.1 in /opt/conda/lib/python3.10/site-packages (from sweetviz) (3.1.2)
Requirement already satisfied: importlib-resources>=1.2.0 in /opt/conda/lib/python3.10/site-packages (from sweetviz) (6.1.1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2>=2.11.1->sweetviz) (2.1.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=3.1.3->sweetviz) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=3.1.3->sweetviz) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=3.1.3->sweetviz) (4.47.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=3.1.3->sweetviz) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=3.1.3->sweetviz) (21.3)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=3.1.3->sweetviz) (9.5.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=3.1.3->sweetviz) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=3.1.3->sweetviz) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2023.4)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib>=3.1.3->sweetviz) (1.16.0)
Using cached sweetviz-2.3.1-py3-none-any.whl (15.1 MB)
Installing collected packages: sweetviz
Successfully installed sweetviz-2.3.1
# import numpy, matplotlib, etc.
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import sklearn
# sklearn imports
from sklearn import metrics
from sklearn import pipeline
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import LeavePOut
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
# load datasets
train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")
submission = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv")
# Train dataset
train
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
| 1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1460 rows × 81 columns
# show train info
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 588 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
#import sweetviz and show report on train
import sweetviz as sw
train_report = sw.analyze(train)
train_report.show_notebook(layout='vertical')
| | [ 0%] 00:00 -> (? left)
Find and fill empty cells
# count empty values in each column
def count_empty_values_in_each_column(df):
print('empty values:')
code = "len(np.where(df[column].isnull())[0])"
for column in df.columns:
if eval(code) > 0:
print(f'`{column}`: {eval(code)}')
count_empty_values_in_each_column(train)
empty values: `LotFrontage`: 259 `Alley`: 1369 `MasVnrType`: 872 `MasVnrArea`: 8 `BsmtQual`: 37 `BsmtCond`: 37 `BsmtExposure`: 38 `BsmtFinType1`: 37 `BsmtFinType2`: 38 `Electrical`: 1 `FireplaceQu`: 690 `GarageType`: 81 `GarageYrBlt`: 81 `GarageFinish`: 81 `GarageQual`: 81 `GarageCond`: 81 `PoolQC`: 1453 `Fence`: 1179 `MiscFeature`: 1406
# replace all empty values to np.NaN values
train.replace('', np.NaN, inplace=True)
train.fillna(np.NaN, inplace=True)
# detect np.NaN or None values in the copy of df
print(f'There are {len(np.where(train.isnull())[0])} empty values in the dataframe')
print(np.where(train.isnull()),'\n')
# count empty values in each column
def count_empty_values_in_each_column(df):
print('empty values:')
code = "len(np.where(df[column].isnull())[0])"
empty_columns = []
for column in df.columns:
if eval(code) > 0:
empty_columns.append(column)
print(f'`{column}`: {eval(code)} , type: {df[column].dtypes}')
return empty_columns
empty_columns = count_empty_values_in_each_column(train)
There are 7829 empty values in the dataframe (array([ 0, 0, 0, ..., 1459, 1459, 1459]), array([ 6, 57, 72, ..., 72, 73, 74])) empty values: `LotFrontage`: 259 , type: float64 `Alley`: 1369 , type: object `MasVnrType`: 872 , type: object `MasVnrArea`: 8 , type: float64 `BsmtQual`: 37 , type: object `BsmtCond`: 37 , type: object `BsmtExposure`: 38 , type: object `BsmtFinType1`: 37 , type: object `BsmtFinType2`: 38 , type: object `Electrical`: 1 , type: object `FireplaceQu`: 690 , type: object `GarageType`: 81 , type: object `GarageYrBlt`: 81 , type: float64 `GarageFinish`: 81 , type: object `GarageQual`: 81 , type: object `GarageCond`: 81 , type: object `PoolQC`: 1453 , type: object `Fence`: 1179 , type: object `MiscFeature`: 1406 , type: object
dropping features with alot missing values and Id that is unnecessary for prediction
#Dropping columns
columns_to_drop = ['Id', 'Alley', 'PoolQC', 'Fence', 'MiscFeature']
train.drop(columns_to_drop, axis=1, inplace=True)
# fill empty values in the dataframe
def fill_na_median(df, column_name):
df_not_null = df[~df[column_name].isnull()]
df[column_name] = df[column_name].fillna(df_not_null[column_name].median())
def fill_na_random_pick_column_distribution(df, column_name):
df_not_null = df[~df[column_name].isnull()]
df_null = df[df[column_name].isnull()]
options = np.random.choice(df_not_null[column_name])
df[column_name] = df[column_name].apply(lambda x: np.random.choice(df_not_null[column_name]) if pd.isnull(x) else x)
# Show empty values and type
empty_columns = count_empty_values_in_each_column(train)
empty values: `LotFrontage`: 259 , type: float64 `MasVnrType`: 872 , type: object `MasVnrArea`: 8 , type: float64 `BsmtQual`: 37 , type: object `BsmtCond`: 37 , type: object `BsmtExposure`: 38 , type: object `BsmtFinType1`: 37 , type: object `BsmtFinType2`: 38 , type: object `Electrical`: 1 , type: object `FireplaceQu`: 690 , type: object `GarageType`: 81 , type: object `GarageYrBlt`: 81 , type: float64 `GarageFinish`: 81 , type: object `GarageQual`: 81 , type: object `GarageCond`: 81 , type: object
#fill empty values for each column
def fill_for_each_type(df, columns):
for column in columns:
if (df[column].dtypes == 'object'):
fill_na_random_pick_column_distribution(df,column)
else:
fill_na_median(df, column)
fill_for_each_type(train,empty_columns)
#Check for empty cells
count_empty_values_in_each_column(train)
empty values:
[]
There are no empty cells in train, lets check the correlation
# show absolute correlation between features in a heatmap
plt.figure(figsize=(24,20))
cor = abs(train.select_dtypes(exclude=[object]).corr())
sns.heatmap(cor, annot=True, cmap=plt.cm.Blues, vmin=0, vmax=1)
plt.show()
Removing the feature with low correlation to the target
# Check for low correlation with the target
low_corr=cor[cor['SalePrice'] < 0.1].index
low_corr
Index(['MSSubClass', 'OverallCond', 'BsmtFinSF2', 'LowQualFinSF',
'BsmtHalfBath', '3SsnPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'],
dtype='object')
# Dropping the low correlation features
train.drop(low_corr, axis=1, inplace=True)
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 66 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSZoning 1460 non-null object 1 LotFrontage 1460 non-null float64 2 LotArea 1460 non-null int64 3 Street 1460 non-null object 4 LotShape 1460 non-null object 5 LandContour 1460 non-null object 6 Utilities 1460 non-null object 7 LotConfig 1460 non-null object 8 LandSlope 1460 non-null object 9 Neighborhood 1460 non-null object 10 Condition1 1460 non-null object 11 Condition2 1460 non-null object 12 BldgType 1460 non-null object 13 HouseStyle 1460 non-null object 14 OverallQual 1460 non-null int64 15 YearBuilt 1460 non-null int64 16 YearRemodAdd 1460 non-null int64 17 RoofStyle 1460 non-null object 18 RoofMatl 1460 non-null object 19 Exterior1st 1460 non-null object 20 Exterior2nd 1460 non-null object 21 MasVnrType 1460 non-null object 22 MasVnrArea 1460 non-null float64 23 ExterQual 1460 non-null object 24 ExterCond 1460 non-null object 25 Foundation 1460 non-null object 26 BsmtQual 1460 non-null object 27 BsmtCond 1460 non-null object 28 BsmtExposure 1460 non-null object 29 BsmtFinType1 1460 non-null object 30 BsmtFinSF1 1460 non-null int64 31 BsmtFinType2 1460 non-null object 32 BsmtUnfSF 1460 non-null int64 33 TotalBsmtSF 1460 non-null int64 34 Heating 1460 non-null object 35 HeatingQC 1460 non-null object 36 CentralAir 1460 non-null object 37 Electrical 1460 non-null object 38 1stFlrSF 1460 non-null int64 39 2ndFlrSF 1460 non-null int64 40 GrLivArea 1460 non-null int64 41 BsmtFullBath 1460 non-null int64 42 FullBath 1460 non-null int64 43 HalfBath 1460 non-null int64 44 BedroomAbvGr 1460 non-null int64 45 KitchenAbvGr 1460 non-null int64 46 KitchenQual 1460 non-null object 47 TotRmsAbvGrd 1460 non-null int64 48 Functional 1460 non-null object 49 Fireplaces 1460 non-null int64 50 FireplaceQu 1460 non-null object 51 GarageType 1460 non-null object 52 GarageYrBlt 1460 non-null float64 53 GarageFinish 1460 non-null object 54 GarageCars 1460 non-null int64 55 GarageArea 1460 non-null int64 56 GarageQual 1460 non-null object 57 GarageCond 1460 non-null object 58 PavedDrive 1460 non-null object 59 WoodDeckSF 1460 non-null int64 60 OpenPorchSF 1460 non-null int64 61 EnclosedPorch 1460 non-null int64 62 ScreenPorch 1460 non-null int64 63 SaleType 1460 non-null object 64 SaleCondition 1460 non-null object 65 SalePrice 1460 non-null int64 dtypes: float64(3), int64(24), object(39) memory usage: 752.9+ KB
Lets remove unnecessary features like we did on train and fill empty cells
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1459 entries, 0 to 1458 Data columns (total 80 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1459 non-null int64 1 MSSubClass 1459 non-null int64 2 MSZoning 1455 non-null object 3 LotFrontage 1232 non-null float64 4 LotArea 1459 non-null int64 5 Street 1459 non-null object 6 Alley 107 non-null object 7 LotShape 1459 non-null object 8 LandContour 1459 non-null object 9 Utilities 1457 non-null object 10 LotConfig 1459 non-null object 11 LandSlope 1459 non-null object 12 Neighborhood 1459 non-null object 13 Condition1 1459 non-null object 14 Condition2 1459 non-null object 15 BldgType 1459 non-null object 16 HouseStyle 1459 non-null object 17 OverallQual 1459 non-null int64 18 OverallCond 1459 non-null int64 19 YearBuilt 1459 non-null int64 20 YearRemodAdd 1459 non-null int64 21 RoofStyle 1459 non-null object 22 RoofMatl 1459 non-null object 23 Exterior1st 1458 non-null object 24 Exterior2nd 1458 non-null object 25 MasVnrType 565 non-null object 26 MasVnrArea 1444 non-null float64 27 ExterQual 1459 non-null object 28 ExterCond 1459 non-null object 29 Foundation 1459 non-null object 30 BsmtQual 1415 non-null object 31 BsmtCond 1414 non-null object 32 BsmtExposure 1415 non-null object 33 BsmtFinType1 1417 non-null object 34 BsmtFinSF1 1458 non-null float64 35 BsmtFinType2 1417 non-null object 36 BsmtFinSF2 1458 non-null float64 37 BsmtUnfSF 1458 non-null float64 38 TotalBsmtSF 1458 non-null float64 39 Heating 1459 non-null object 40 HeatingQC 1459 non-null object 41 CentralAir 1459 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1459 non-null int64 44 2ndFlrSF 1459 non-null int64 45 LowQualFinSF 1459 non-null int64 46 GrLivArea 1459 non-null int64 47 BsmtFullBath 1457 non-null float64 48 BsmtHalfBath 1457 non-null float64 49 FullBath 1459 non-null int64 50 HalfBath 1459 non-null int64 51 BedroomAbvGr 1459 non-null int64 52 KitchenAbvGr 1459 non-null int64 53 KitchenQual 1458 non-null object 54 TotRmsAbvGrd 1459 non-null int64 55 Functional 1457 non-null object 56 Fireplaces 1459 non-null int64 57 FireplaceQu 729 non-null object 58 GarageType 1383 non-null object 59 GarageYrBlt 1381 non-null float64 60 GarageFinish 1381 non-null object 61 GarageCars 1458 non-null float64 62 GarageArea 1458 non-null float64 63 GarageQual 1381 non-null object 64 GarageCond 1381 non-null object 65 PavedDrive 1459 non-null object 66 WoodDeckSF 1459 non-null int64 67 OpenPorchSF 1459 non-null int64 68 EnclosedPorch 1459 non-null int64 69 3SsnPorch 1459 non-null int64 70 ScreenPorch 1459 non-null int64 71 PoolArea 1459 non-null int64 72 PoolQC 3 non-null object 73 Fence 290 non-null object 74 MiscFeature 51 non-null object 75 MiscVal 1459 non-null int64 76 MoSold 1459 non-null int64 77 YrSold 1459 non-null int64 78 SaleType 1458 non-null object 79 SaleCondition 1459 non-null object dtypes: float64(11), int64(26), object(43) memory usage: 912.0+ KB
#Dropping columns
test.drop(columns_to_drop, axis=1, inplace=True)
# Show empty values and type
empty_columns = count_empty_values_in_each_column(test)
empty values: `MSZoning`: 4 , type: object `LotFrontage`: 227 , type: float64 `Utilities`: 2 , type: object `Exterior1st`: 1 , type: object `Exterior2nd`: 1 , type: object `MasVnrType`: 894 , type: object `MasVnrArea`: 15 , type: float64 `BsmtQual`: 44 , type: object `BsmtCond`: 45 , type: object `BsmtExposure`: 44 , type: object `BsmtFinType1`: 42 , type: object `BsmtFinSF1`: 1 , type: float64 `BsmtFinType2`: 42 , type: object `BsmtFinSF2`: 1 , type: float64 `BsmtUnfSF`: 1 , type: float64 `TotalBsmtSF`: 1 , type: float64 `BsmtFullBath`: 2 , type: float64 `BsmtHalfBath`: 2 , type: float64 `KitchenQual`: 1 , type: object `Functional`: 2 , type: object `FireplaceQu`: 730 , type: object `GarageType`: 76 , type: object `GarageYrBlt`: 78 , type: float64 `GarageFinish`: 78 , type: object `GarageCars`: 1 , type: float64 `GarageArea`: 1 , type: float64 `GarageQual`: 78 , type: object `GarageCond`: 78 , type: object `SaleType`: 1 , type: object
#Fill the empty cells
fill_for_each_type(test,empty_columns)
#Check for empty cells
count_empty_values_in_each_column(test)
empty values:
[]
# Dropping low correlation features
test.drop(low_corr, axis=1, inplace=True)
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1459 entries, 0 to 1458 Data columns (total 65 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSZoning 1459 non-null object 1 LotFrontage 1459 non-null float64 2 LotArea 1459 non-null int64 3 Street 1459 non-null object 4 LotShape 1459 non-null object 5 LandContour 1459 non-null object 6 Utilities 1459 non-null object 7 LotConfig 1459 non-null object 8 LandSlope 1459 non-null object 9 Neighborhood 1459 non-null object 10 Condition1 1459 non-null object 11 Condition2 1459 non-null object 12 BldgType 1459 non-null object 13 HouseStyle 1459 non-null object 14 OverallQual 1459 non-null int64 15 YearBuilt 1459 non-null int64 16 YearRemodAdd 1459 non-null int64 17 RoofStyle 1459 non-null object 18 RoofMatl 1459 non-null object 19 Exterior1st 1459 non-null object 20 Exterior2nd 1459 non-null object 21 MasVnrType 1459 non-null object 22 MasVnrArea 1459 non-null float64 23 ExterQual 1459 non-null object 24 ExterCond 1459 non-null object 25 Foundation 1459 non-null object 26 BsmtQual 1459 non-null object 27 BsmtCond 1459 non-null object 28 BsmtExposure 1459 non-null object 29 BsmtFinType1 1459 non-null object 30 BsmtFinSF1 1459 non-null float64 31 BsmtFinType2 1459 non-null object 32 BsmtUnfSF 1459 non-null float64 33 TotalBsmtSF 1459 non-null float64 34 Heating 1459 non-null object 35 HeatingQC 1459 non-null object 36 CentralAir 1459 non-null object 37 Electrical 1459 non-null object 38 1stFlrSF 1459 non-null int64 39 2ndFlrSF 1459 non-null int64 40 GrLivArea 1459 non-null int64 41 BsmtFullBath 1459 non-null float64 42 FullBath 1459 non-null int64 43 HalfBath 1459 non-null int64 44 BedroomAbvGr 1459 non-null int64 45 KitchenAbvGr 1459 non-null int64 46 KitchenQual 1459 non-null object 47 TotRmsAbvGrd 1459 non-null int64 48 Functional 1459 non-null object 49 Fireplaces 1459 non-null int64 50 FireplaceQu 1459 non-null object 51 GarageType 1459 non-null object 52 GarageYrBlt 1459 non-null float64 53 GarageFinish 1459 non-null object 54 GarageCars 1459 non-null float64 55 GarageArea 1459 non-null float64 56 GarageQual 1459 non-null object 57 GarageCond 1459 non-null object 58 PavedDrive 1459 non-null object 59 WoodDeckSF 1459 non-null int64 60 OpenPorchSF 1459 non-null int64 61 EnclosedPorch 1459 non-null int64 62 ScreenPorch 1459 non-null int64 63 SaleType 1459 non-null object 64 SaleCondition 1459 non-null object dtypes: float64(9), int64(17), object(39) memory usage: 741.0+ KB
# divide the data to features and target
t = train['SalePrice'].copy()
X = train.drop(['SalePrice'], axis=1)
print('t')
display(t)
print()
print('X')
display(X)
t
0 208500
1 181500
2 223500
3 140000
4 250000
...
1455 175000
1456 210000
1457 266500
1458 142125
1459 147500
Name: SalePrice, Length: 1460, dtype: int64
X
| MSZoning | LotFrontage | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | ... | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | ScreenPorch | SaleType | SaleCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RL | 65.0 | 8450 | Pave | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | ... | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | WD | Normal |
| 1 | RL | 80.0 | 9600 | Pave | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | ... | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | WD | Normal |
| 2 | RL | 68.0 | 11250 | Pave | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | ... | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | WD | Normal |
| 3 | RL | 60.0 | 9550 | Pave | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | ... | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | WD | Abnorml |
| 4 | RL | 84.0 | 14260 | Pave | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | ... | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | WD | Normal |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | RL | 62.0 | 7917 | Pave | Reg | Lvl | AllPub | Inside | Gtl | Gilbert | ... | 460 | TA | TA | Y | 0 | 40 | 0 | 0 | WD | Normal |
| 1456 | RL | 85.0 | 13175 | Pave | Reg | Lvl | AllPub | Inside | Gtl | NWAmes | ... | 500 | TA | TA | Y | 349 | 0 | 0 | 0 | WD | Normal |
| 1457 | RL | 66.0 | 9042 | Pave | Reg | Lvl | AllPub | Inside | Gtl | Crawfor | ... | 252 | TA | TA | Y | 0 | 60 | 0 | 0 | WD | Normal |
| 1458 | RL | 68.0 | 9717 | Pave | Reg | Lvl | AllPub | Inside | Gtl | NAmes | ... | 240 | TA | TA | Y | 366 | 0 | 112 | 0 | WD | Normal |
| 1459 | RL | 75.0 | 9937 | Pave | Reg | Lvl | AllPub | Inside | Gtl | Edwards | ... | 276 | TA | TA | Y | 736 | 68 | 0 | 0 | WD | Normal |
1460 rows × 65 columns
# find generator length
from tqdm.auto import tqdm
def find_generator_len(generator, use_pbar=True):
i = 0
if use_pbar:
pbar = tqdm(desc='Calculating Length',
ncols=1000,
bar_format='{desc}{bar:10}{r_bar}')
for a in generator:
i += 1
if use_pbar:
pbar.update()
if use_pbar:
pbar.close()
return i
Importing "Kfold" and "Leave p out" for cross validation
from sklearn.model_selection import KFold
from sklearn.model_selection import LeavePOut
# calculate score and loss from cv (KFold or LPO) and display graphs
def get_cv_score_and_loss(X, t, model, transformer=None,
k=None, p=None,
show_score_loss_graphs=False, use_pbar=True):
scores_losses_df = pd.DataFrame(columns=['fold_id', 'split', 'score', 'loss'])
if k is not None:
cv = KFold(n_splits=k, shuffle=True, random_state=42)
elif p is not None:
cv = LeavePOut(p)
else:
raise ValueError('you need to specify k or p in order for the cv to work')
if use_pbar:
pbar = tqdm(desc='Computing Models',
total=find_generator_len(cv.split(X)))
for i, (train_ids, val_ids) in enumerate(cv.split(X)):
X_train = X.loc[train_ids]
t_train = t.loc[train_ids]
X_val = X.loc[val_ids]
t_val = t.loc[val_ids]
model.fit(X_train, t_train)
y_train = model.predict(X_train)
y_val = model.predict(X_val)
scores_losses_df.loc[len(scores_losses_df)] =\
[i, 'train', model.score(X_train, t_train),
mean_squared_error(t_train, y_train)]
scores_losses_df.loc[len(scores_losses_df)] =\
[i, 'val', model.score(X_val, t_val), mean_squared_error(t_val, y_val)]
if use_pbar:
pbar.update()
if use_pbar:
pbar.close()
val_scores_losses_df = scores_losses_df[scores_losses_df['split']=='val']
train_scores_losses_df = scores_losses_df[scores_losses_df['split']=='train']
mean_val_score = val_scores_losses_df['score'].mean()
mean_val_loss = val_scores_losses_df['loss'].mean()
mean_train_score = train_scores_losses_df['score'].mean()
mean_train_loss = train_scores_losses_df['loss'].mean()
if show_score_loss_graphs:
fig = px.line(scores_losses_df, x='fold_id', y='score', color='split', title=f'Mean Val Score: {mean_val_score:.2f}, Mean Train Score: {mean_train_score:.2f}')
fig.show()
fig = px.line(scores_losses_df, x='fold_id', y='loss', color='split', title=f'Mean Val Loss: {mean_val_loss:.2f}, Mean Train Loss: {mean_train_loss:.2f}')
fig.show()
return mean_val_score, mean_val_loss,\
mean_train_score, mean_train_loss
# choose the best 3 features of this dataset with SGDRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct_enc_std = ColumnTransformer([
("encoding", OrdinalEncoder(), categorical_cols),
("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct_enc_std.fit_transform(X, t), columns=all_cols)
test_encoded = pd.DataFrame(ct_enc_std.fit_transform(test), columns=all_cols)
def print_score_and_lose(X,model):
val_score, val_loss, train_score, train_loss = get_cv_score_and_loss(X, t, model, k=10, show_score_loss_graphs=True)
print(f'mean cv val score: {val_score:.2f}\nmean cv val loss {val_loss:.2f}')
print(f'mean cv val score: {train_score:.2f}\nmean cv val loss {train_loss:.2f}')
# find best subset of features on this dataset with SGDRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import RepeatedKFold
selector_SGD = RFECV(
SGDRegressor(random_state=42),
cv=RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
).fit(X_encoded, t)
X_selected_features_sgd = X_encoded.loc[:, selector_SGD.support_]
display(X_selected_features_sgd)
| Street | LandContour | LandSlope | Condition2 | BldgType | RoofStyle | RoofMatl | MasVnrType | ExterQual | ExterCond | ... | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | GrLivArea | BsmtFullBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | GarageCars | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 | ... | -0.944591 | -0.459303 | -0.793434 | 1.161852 | 0.370333 | 1.107810 | 0.163779 | -0.211454 | 0.912210 | 0.311725 |
| 1 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 | ... | -0.641228 | 0.466465 | 0.257140 | -0.795163 | -0.482512 | -0.819964 | 0.163779 | -0.211454 | -0.318683 | 0.311725 |
| 2 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 | ... | -0.301643 | -0.313369 | -0.627826 | 1.189351 | 0.515013 | 1.107810 | 0.163779 | -0.211454 | -0.318683 | 0.311725 |
| 3 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 4.0 | ... | -0.061670 | -0.687324 | -0.521734 | 0.937276 | 0.383659 | 1.107810 | 0.163779 | -0.211454 | 0.296763 | 1.650307 |
| 4 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 | ... | -0.174865 | 0.199680 | -0.045611 | 1.617877 | 1.299326 | 1.107810 | 1.390023 | -0.211454 | 1.527656 | 1.650307 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 4.0 | ... | 0.873321 | -0.238122 | -0.542435 | 0.795198 | 0.250402 | -0.819964 | 0.163779 | -0.211454 | 0.296763 | 0.311725 |
| 1456 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 | ... | 0.049262 | 1.104925 | 2.355701 | -0.795163 | 1.061367 | 1.107810 | 0.163779 | -0.211454 | 0.296763 | 0.311725 |
| 1457 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 2.0 | ... | 0.701265 | 0.215641 | 0.065656 | 1.844744 | 1.569647 | -0.819964 | 1.390023 | -0.211454 | 1.527656 | -1.026858 |
| 1458 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 3.0 | 1.0 | 1.0 | 3.0 | 4.0 | ... | -1.284176 | 0.046905 | -0.218982 | -0.795163 | -0.832788 | 1.107810 | -1.062465 | -0.211454 | -0.934130 | -1.026858 |
| 1459 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 | ... | -0.976285 | 0.452784 | 0.241615 | -0.795163 | -0.493934 | 1.107810 | 0.163779 | -0.211454 | -0.318683 | -1.026858 |
1460 rows × 26 columns
print('SGD Regressor')
print_score_and_lose(X_selected_features_sgd, selector_SGD)
SGD Regressor
Calculating Length | 0/? [00:00<?, ?it/s]
Computing Models: 0%| | 0/10 [00:00<?, ?it/s]
mean cv val score: 0.77 mean cv val loss 1418892682.62 mean cv val score: 0.80 mean cv val loss 1235393797.39
# find best subset of features on this dataset with LinearRegression
selector_lr = RFECV(
LinearRegression(),
cv=RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
).fit(X_encoded, t)
X_selected_features_lr = X_encoded.loc[:, selector_lr.support_]
display(X_selected_features_lr)
| MSZoning | Street | LotShape | LandContour | Utilities | LandSlope | Condition1 | Condition2 | BldgType | HouseStyle | ... | BsmtFullBath | FullBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | WoodDeckSF | ScreenPorch | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 5.0 | ... | 1.107810 | 0.789741 | 0.163779 | -0.211454 | 0.912210 | -0.951226 | 1.017598 | 0.311725 | -0.752176 | -0.270208 |
| 1 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 2.0 | ... | -0.819964 | 0.789741 | 0.163779 | -0.211454 | -0.318683 | 0.600495 | -0.107927 | 0.311725 | 1.626195 | -0.270208 |
| 2 | 3.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 5.0 | ... | 1.107810 | 0.789741 | 0.163779 | -0.211454 | -0.318683 | 0.600495 | 0.934226 | 0.311725 | -0.752176 | -0.270208 |
| 3 | 3.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 5.0 | ... | 1.107810 | -1.026041 | 0.163779 | -0.211454 | 0.296763 | 0.600495 | 0.809167 | 1.650307 | -0.752176 | -0.270208 |
| 4 | 3.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 5.0 | ... | 1.107810 | 0.789741 | 1.390023 | -0.211454 | 1.527656 | 0.600495 | 0.892540 | 1.650307 | 0.780197 | -0.270208 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 5.0 | ... | -0.819964 | 0.789741 | 0.163779 | -0.211454 | 0.296763 | 0.600495 | 0.850854 | 0.311725 | -0.752176 | -0.270208 |
| 1456 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 2.0 | ... | 1.107810 | 0.789741 | 0.163779 | -0.211454 | 0.296763 | 2.152216 | -0.024555 | 0.311725 | 2.033231 | -0.270208 |
| 1457 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 5.0 | ... | -0.819964 | 0.789741 | 1.390023 | -0.211454 | 1.527656 | 2.152216 | -1.566941 | -1.026858 | -0.752176 | -0.270208 |
| 1458 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 2.0 | ... | 1.107810 | -1.026041 | -1.062465 | -0.211454 | -0.934130 | -0.951226 | -1.191766 | -1.026858 | 2.168910 | -0.270208 |
| 1459 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 2.0 | ... | 1.107810 | -1.026041 | 0.163779 | -0.211454 | -0.318683 | -0.951226 | -0.566474 | -1.026858 | 5.121921 | -0.270208 |
1460 rows × 46 columns
Printing graphs of the score and lose for each fold
print('Linear Regression')
print_score_and_lose(X_selected_features_lr, selector_lr)
Linear Regression
Calculating Length | 0/? [00:00<?, ?it/s]
Computing Models: 0%| | 0/10 [00:00<?, ?it/s]
mean cv val score: 0.80 mean cv val loss 1216405828.55 mean cv val score: 0.84 mean cv val loss 978601590.91
As we can see the validation lose and the score is better using linear regression. The best fold is 9.
Lets check if after regularization the sgd will improve
# train with grid search and get best parameters
from sklearn.model_selection import GridSearchCV
hyper_parameters = {'penalty': ('l2', 'l1', 'elasticnet'), 'alpha': [0.0001, 0.001, 0.01, 0.1], 'eta0': [0.001, 0.01, 0.1, 0.5]}
gs_model = GridSearchCV(SGDRegressor(random_state=42), hyper_parameters).fit(X_selected_features_sgd, t)
print('best params', gs_model.best_params_)
/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
best params {'alpha': 0.1, 'eta0': 0.001, 'penalty': 'elasticnet'}
# train with random search and get best parameters
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
np.random.seed(1)
distributions = dict(alpha=uniform(loc=0, scale=1), penalty=['l2', 'l1', 'elasticnet'], eta0=[0.001, 0.01, 0.1, 0.5])
rs_model = RandomizedSearchCV(SGDRegressor(), distributions, random_state=42).fit(X_selected_features_sgd, t)
print('best params', rs_model.best_params_)
/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
best params {'alpha': 0.6174815096277165, 'eta0': 0.01, 'penalty': 'l1'}
/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
print('GridSearchCV')
print_score_and_lose(X_selected_features_sgd, gs_model)
print('RandomizedSearchCV')
print_score_and_lose(X_selected_features_sgd, rs_model)
GridSearchCV
Calculating Length | 0/? [00:00<?, ?it/s]
Computing Models: 0%| | 0/10 [00:00<?, ?it/s]
/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
mean cv val score: 0.77 mean cv val loss 1412047846.12 mean cv val score: 0.80 mean cv val loss 1232345259.94 RandomizedSearchCV
Calculating Length | 0/? [00:00<?, ?it/s]
Computing Models: 0%| | 0/10 [00:00<?, ?it/s]
/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. /opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
mean cv val score: 0.78 mean cv val loss 1375987026.28 mean cv val score: 0.81 mean cv val loss 1223520284.66
As we can see the regularization didnt get better score than the linear regression.
To predict the test we will lose the linear model.
# Model for prediction with 9 splits
lr_model = RFECV(
LinearRegression(),
cv=RepeatedKFold(n_splits=9, n_repeats=10, random_state=42)
).fit(X_encoded, t)
submission['SalePrice'] = lr_model.predict(test_encoded)
submission.to_csv('submission.csv', index=False)
display(submission)
| Id | SalePrice | |
|---|---|---|
| 0 | 1461 | 110263.050342 |
| 1 | 1462 | 171099.624741 |
| 2 | 1463 | 174575.361873 |
| 3 | 1464 | 190693.752952 |
| 4 | 1465 | 185638.988891 |
| ... | ... | ... |
| 1454 | 2915 | 56860.195649 |
| 1455 | 2916 | 54328.013466 |
| 1456 | 2917 | 139090.008993 |
| 1457 | 2918 | 113509.734548 |
| 1458 | 2919 | 244608.846118 |
1459 rows × 2 columns
At first I checked if there are empty cells and filled them with random/median values. Then I used sweetviz and autoviz to display the correlation and data.I removed unnecessary features and after it I analyzed the correlation with a heatmap and then removed all the low correlation features (less than 0.1). By using cross validation I checked different models and decide which of them is better and which k fold to choose. I tried using regularization techniques on SGD model to make the score better but it didnt get better. The model that been chosen for the prediction is linear regression.
sklearn - https://scikit-learn.org/stable/
Afeka course notebooks